Genome-Wide Association Analysis (GWAS) - Concepts and Hands-on

Felipe Ferrão
University of Florida -- Oct 23, 2018

Content

Objectives

  • Present intuitive examples on GWAS analysis

  • Discuss R codes and models

  • Mixture of theory and hands-on

Concepts

  • Molecular Markers (ex: SNPs)

  • Linkage Disequilibrium (LD)

  • Linear Regression

  • Confounding factor (false-positve)

Motivation

alt text

Intuition

Example

  • Hypothetical species with 3 chromosomes

  • Molecular markers along the genome

  • Quantitative trait ?

alt text

Intuition

  • What source of observation is known?

  • What source of observation is unknown?

alt text

Intuition

alt text

Intuition

  • The idea of associating between markers and genes; and assume that both segregate together across generations refers to the concept of linkage disequilibrium

  • If marker and gene are in equilibrium, the knowledge of the marker does not present any value to the selection.

Formal Definition

  • LD is the non-random association of alleles at different loci in a given population.

Be careful !!

  • LD and physical linkage are not synonymous !!!

  • LD is influenced by many factors, including selection, mutation, genetic drift, the system of mating, population structure and genetic linkage

Intuition

alt text

Toy example and two questions:

  • My trait: plant size

  • 12 homozygous individuals (AA or aa)

  • 3 SNPs genotyped in one chr.

  • Can I use markers to study the phenotypic variation observed?

  • What marker (SNP1, SNP2, or SNP3) ?

Intuition

  • We can use simple plots to visualize the problem.
trait = c(50,15,18,68,5,15,15,10,60,15,40,12)
snp1 = as.factor(c("AA","aa", "aa", "AA", "aa", "aa","aa","aa", "AA", "aa", "AA", "aa"))
snp2 = as.factor(c("AA","AA","AA","AA","AA","AA","aa","aa","aa","aa","aa","aa"))
par(mfrow=c(1,2))
plot(trait~snp1, main= "SNP1")
plot(trait~snp2, main="SNP2")

plot of chunk unnamed-chunk-1

Linear Regression

  • SNP1 may be associated with plant size.

  • How to formulate our hypothesis?

  • Linear Regression is a method that summarizes how the average values of an outcome variable (phenotypes) vary in function of the predictor variables (molecular markers).

  • Ex: height ~ body mass

plot of chunk unnamed-chunk-2

Linear Regression

Ordinary Least Square

  • Derive a estimator that minimize the prediction error variance, which is the expected value of the squared difference between predicted values and their unobserved true values

  • Intuitively: given that we are trying to predict an outcome using other variables, we want to do so in such a way as to minimize the error of our prediction

\[ y_i = \beta_0 + \beta_1 x_i + e_i \]

\[ \hat{\beta} = (X'X)^{-1}X'y \]

  • Test Hyphotesis:

\( H_0:\beta=0 \) \( H_1:\beta \neq 0 \)

alt text

Activity - Insert animated GIF to HTML

Linear Regression

summary(lm(trait~as.numeric(snp1)))$coefficients
                 Estimate Std. Error   t value     Pr(>|t|)
(Intercept)       -28.250   6.468433 -4.367364 1.404874e-03
as.numeric(snp1)   41.375   4.573873  9.045945 3.952776e-06
summary(lm(trait~as.numeric(snp2)))$coefficients
                  Estimate Std. Error   t value  Pr(>|t|)
(Intercept)      22.166667   20.60104 1.0759975 0.3072024
as.numeric(snp2)  3.166667   13.02924 0.2430431 0.8128848

plot of chunk unnamed-chunk-5

Simulated ExampleSimulated Example

# Read in data - simulated with 10 QTL
gwas=readRDS("gwasData.rds")
pheno=read.table("phenotypes.txt",
                 header=T,sep="\t")$Pheno
map=read.table("map.txt",header=T,sep="\t")
dim(gwas)
[1] 10000  2000
gwas[1:5,1:5]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    2    0    1
[2,]    1    1    0    1    1
[3,]    1    2    1    2    1
[4,]    0    2    0    1    0
[5,]    1    0    0    0    2
map[1:3,]
  snp chrom loc
1   1     1   1
2   2     1   2
3   3     1   3

Simulated Example

Single SNP regression with lm

  • lm() R function implements the least squares to estimate the intercept and slope
effect=numeric(10000) # effect sizes (coefficients)
pval=numeric(10000) # p-values
for (i in 1:10000){
  res=coef(summary(lm(pheno~gwas[i,])))[2,c(1,4)]
  effect[i]=res[1]
  pval[i]=res[2]
}
effect[1:4]
[1] -0.4989296  0.1690589  0.1876288 -0.1043797
pval[1:4]
[1] 0.01431413 0.40415900 0.35829852 0.60490297

Simulated Example

  • Plot the SNP effects.
plot(effect,xlab="single SNP regressions",main="SNP effect estimates",pch=20, cex=3)
abline(v=QTL$indexQTL,col="gray")
points(QTL$indexQTL,QTL$QTLval,col="red",pch=17,cex=2)
legend("topright",c("single SNP","true QTL"),
       fil=c("black","red"),cex=0.8)

plot of chunk unnamed-chunk-10

Simulated Example

  • Manhattan plot
library(qqman)
library(RColorBrewer)
mypalette<-brewer.pal(9,"Set1")
df = data.frame(SNP=map$snp,CHR=map$chrom,BP=map$loc,P=pval)
manhattan(df,col=mypalette,suggestiveline=F,genomewideline=F,cex=2)

plot of chunk unnamed-chunk-11

Counfounding - Toy Example

  • 90 students in the classroom

  • Relationship between hair size and height of people

  • Statistical Model: hair_size = mean + height + error

plot of chunk unnamed-chunk-12

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 38.8703831 3.60057792 10.795596 8.503132e-18
hight       -0.1845235 0.02065016 -8.935693 5.568133e-14

Counfounding - Toy Example

  • We also took notes about sex.

  • On average, girls have longer hair and are shorter than boys.

plot of chunk unnamed-chunk-13

                      Estimate Std. Error    t value     Pr(>|t|)
(Intercept)        5.577631689 4.45579195  1.2517711 2.140084e-01
as.factor(cor)red  2.931686390 0.31974835  9.1687303 2.026746e-14
hight             -0.002799516 0.02474259 -0.1131456 9.101758e-01

Population Structure

alt text

alt text

alt text

Q+K GWAS model

  • Original model: y=mean + snp + error

  • Q+K model: y = mean + snp + pop.struc + kinship + error

Important points:

The Pulpit Rock

Example - Q+K model

# 1) Kinship matrix (rrBLUP, AGHmatrix, pedigreem and etc)
library(rrBLUP)
Imputation <- A.mat(geno.filtered,impute.method="EM",return.imputed=T,min.MAF=0.05)
K.mat <- Imputation$A ### KINSHIP matrix
K.mat[1:5,1:5]## view Kinship
          Oat1         Oat2       Oat3        Oat4         Oat5
Oat1 1.8191561  0.220454719  0.1009423  0.15218203  0.177660657
Oat2 0.2204547  2.111943356  0.1266988  0.02978199 -0.007214392
Oat3 0.1009423  0.126698834  1.8000414 -0.12678093 -0.126589635
Oat4 0.1521820  0.029781989 -0.1267809  1.87520005  1.803271696
Oat5 0.1776607 -0.007214392 -0.1265896  1.80327170  1.873595678

Example - Q+K model

  • M1 (Simple): y = mean + env + snp + error

  • M2 (Q model): y = mean + env + snp + PCA + error

  • M3 (K model): y = mean + env + snp + kinship + error

  • M4 (Q+K model): y = mean + env + snp + PCA + kinship + error

#2) PCA is computed internally 
# pheno = data.frame with phenotypes
# geno = genotype scored as -1,0,1
# K = pedigree matrix
# n.PC = number of principal components to include as fixed effects
library(rrBLUP)
gwasresults<-GWAS(pheno.gwas,geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=NULL,n.PC=0)
gwasresults2<-GWAS(pheno.gwas,geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=NULL,n.PC=6)
gwasresults3<-GWAS(pheno.gwas,geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=K.mat,n.PC=0)
gwasresults4<-GWAS(pheno.gwas,geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=K.mat,n.PC=6)

Example - Q+K model

  • QQ-plot: quantile distribution of observed p-values (on the y-axis) on the quantile distribution of expected p-values (on x-axis).

  • It is is a visual diagnostic tool that can identify the effects of accounted for population structure, relatedness and other issues in GWAS analysis

  • An example that pouplation structure effects were not strong

plot of chunk unnamed-chunk-19

Example - Q+K model

plot of chunk unnamed-chunk-20

Example - Q+K model

alt text

Other relevant topics in GWAS

Multiple Testing

  • Bonferroni

  • FDR

  • Permutation

GWAS models

  • Preprocessing and quality control

Important softwares (MLM models)

Important References